GeneMark-RC, a Recursive Procedure for Gene Identi cation in the Genomic Sequence Data with Self-Consistency Evaluation; Its Application to the Analysis of Several Prokaryotic Genomes

نویسندگان

  • Makoto Hirosawa
  • Katsumi Isono
چکیده

Previously, we developed a GeneMark-based procedure, termed GeneMark-RC, and applied it for the identi cation and classi cation of ORFs in genomic sequence data, and identi ed and characterized ORFs in the 1.0 Mb data of the cyanobacterium Synechocystis sp. strain PCC 6803. In the present study, we have improved the procedure and performed analysis of the whole genomic data of Synechocystis. Consequently, we noticed the presence of three distinct classes of ORFs in this organism. The prediction of ORFs by the class-speci c GeneMark-RC analysis agreed with 97.9 % of those described for this bacterium. Moreover, 124 additional ORFs were identi ed. The procedure was similarly applied to the genomic analysis of ve other prokaryotes, and 2 to 3 classes of ORFs were recognized in each case. Common features were found among the ORFs identi ed in the six organisms including Synechocystis. Class 1 is composed of most typical ORFs whose GC content is slightly higher than the average, while Class 2 is composed of ORFs with GC contents lower than the average. It was found that ORFs of one species can be detected with the GeneMarkRC parameters obtained from other organisms, and the prediction rate is high when the di erence in their GC contents is small. It was also found that ORFs of three species with relatively low GC contents can be nicely detected with the Synechocystis matrices of Class 2 ORFs whose GC content is similar to that of the three species. Therefore, although there are two to three classes of ORFs in each species, their di-codon statistics must be rather similar to each other if their GC contents are similar. A notable exception was the case of Methanococcus jannaschii, which might re ect the fact that it is an archaebacterium.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Application of GeneMark-RC to the Analysis of Prokaryotic Genomes and Human cDNAs: Sequence Data with Statistical Deviations Are Rich in Important Biological Information

Assignment of coding regions is the first step in genome sequence analysis. Although the complete genome sequences of sixteen organisms have already been published, the strategies adopted for coding region assignment are different from one organism to another, and consequently the data are not readily suited for direct comparative analysis. Evidently, a more unified method for defining coding r...

متن کامل

GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses

The task of gene identification frequently confronting researchers working with both novel and well studied genomes can be conveniently and reliably solved with the help of the GeneMark web software (http://opal.biology.gatech.edu/GeneMark/). The website provides interfaces to the GeneMark family of programs designed and tuned for gene prediction in prokaryotic, eukaryotic and viral genomic seq...

متن کامل

Prediction Rate of Coding Regions is Enhanced upto 99.15 % by Joint Use of GeneMark-RC and GeneHacker in Case of a Cyanobacterium

The advancement in large-scale sequencing has accelerated the production of long contiguous nucleotide sequence data. The whole genomic sequence data is currently available for several prokaryotic organisms. The rst step in the analysis of genomic sequence data is to assign coding regions, which is absolutely necessary for a comparative study of one organism with the others and to elucidate com...

متن کامل

Computer survey for likely genes in the one megabase contiguous genomic sequence data of Synechocystis sp. strain PCC6803.

Using the computer program GeneMark, the open reading frames (ORFs) previously assigned within the one megabase sequence data of the genome of the cyanobacterium, Synechocystis sp. strain PCC6803 (Kaneko et al., DNA Res. 2: 153-166, 1995), were re-examined. Matrices required by GeneMark for its statistical calculation were generated and modified by running a script termed GeneMark-Genesis that ...

متن کامل

Comparative bioinformatics analysis of a wild diploid Gossypium with two cultivated allotetraploid species

Background: Gossypium thurberi is a wild diploid species that has been used to improve cultivated allotetraploid cotton. G. thurberi belongs to D genome, which is an important wild bio-source for the cotton breeding and genetic research. To a certain degree, chloroplast DNA sequence information are a versatile tool for species identification and phylogenetic implications in plants. Different ch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998